Skip to content

Conversation

@bingquanzhao
Copy link
Contributor

@bingquanzhao bingquanzhao commented May 28, 2025

Summary

This PR introduces a new Apache Doris sink for Vector, enabling users to send log data directly to Apache Doris databases using the Stream Load API. The implementation includes:

  • Complete Doris sink implementation with Stream Load API integration
  • Comprehensive configuration options (endpoints, authentication, batching, custom headers)
  • Full documentation generation using CUE
  • Health check functionality with proper error handling
  • Support for Doris-specific Stream Load parameters via custom HTTP headers

Apache Doris is a modern MPP analytical database that provides sub-second query response times on large datasets, making it ideal for real-time data warehouses and log analysis scenarios.

Change Type

  • New feature
  • Bug fix
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

Local Testing

  1. Unit Tests: All unit tests pass with cargo test
  2. Configuration Validation: Verified config parsing with vector validate
  3. Documentation Generation: Successfully generated docs with make generate-component-docs
  4. CUE Validation: All CUE files pass format and validation checks
  5. Changelog Validation: Changelog fragment passes validation with ./scripts/check_changelog_fragments.sh

Test Configuration Used

sources:
  demo:
    type: demo_logs
    format: json
    interval: 1

sinks:
  doris:
    type: doris
    inputs: ["demo"]
    
    # Target configuration
    endpoints: 
      - "http://doris-fe1:8030"
      - "http://doris-fe2:8030"
    database: "analytics_db"
    table: "user_events"
    
    # Authentication configuration
    auth:
      strategy: basic
      user: "admin"
      password: "admin123"
    
    # Batch configuration
    batch:
      max_events: 100000        # Maximum events per batch
      timeout_secs: 30          # Batch timeout in seconds
      max_bytes: 1073741824     # Maximum bytes per batch (1GB)
    
    # Custom HTTP headers for Doris Stream Load
    headers:
      format: "json"
      strip_outer_array: "false"
      read_json_by_line: "true"
    
    # Additional configuration
    label_prefix: "vector"
    log_request: true
    log_progress_interval: 10
    buffer_bound: 1

Environment Setup

  • Tested configuration validation against Vector's validation system
  • Verified health check functionality (attempts connection to configured endpoints)
  • All documentation generation and validation checks pass
  • CUE v0.7.0 used for documentation generation

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Notes

Implementation Details

  • Stream Load API: Uses Doris's native Stream Load API for optimal performance and compatibility
  • Authentication: Supports basic authentication with username/password
  • Batching: Configurable batching with event count, byte size, and timeout limits
  • Custom Headers: Support for Doris-specific Stream Load parameters via HTTP headers including:
    • format: Data format specification (json, csv, etc.)
    • read_json_by_line: JSON line-by-line reading mode
    • strip_outer_array: Array handling configuration
    • columns: Column mapping specification
  • Error Handling: Comprehensive error handling with configurable retry logic
  • Health Checks: Validates connectivity and basic authentication
  • Rate Limiting: Built-in rate limiting and adaptive concurrency control

Documentation

  • Added complete CUE documentation for the sink configuration
  • Generated reference documentation automatically using Vector's documentation system
  • Updated service definitions and URL references
  • All documentation validation checks pass (CI=true make check-docs)

Dependencies

  • No new external dependencies added
  • Uses existing Vector HTTP client infrastructure
  • Leverages standard Vector authentication, batching, and request frameworks
  • Follows Vector's established patterns for sink implementation

Code Quality

  • All code formatted with cargo fmt
  • Follows Vector's coding standards and patterns
  • Proper error handling and logging throughout
  • Comprehensive configuration validation

Testing Strategy

  • Configuration validation ensures all options are properly parsed
  • Health check functionality verified through connection attempts
  • Documentation generation confirms all metadata is correctly defined
  • Follows Vector's established testing patterns for sinks

References

@bingquanzhao bingquanzhao requested review from a team as code owners May 28, 2025 16:18
@bits-bot
Copy link

bits-bot commented May 28, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added domain: sinks Anything related to the Vector's sinks domain: ci Anything related to Vector's CI environment domain: external docs Anything related to Vector's external, public documentation labels May 28, 2025
@drichards-87
Copy link
Contributor

Created Jira card for Docs Team review.

Copy link
Contributor

@maycmlee maycmlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small suggestions

Copy link
Contributor

@maycmlee maycmlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for docs

@pront
Copy link
Member

pront commented Jun 25, 2025

Hi @bingquanzhao, thank you for this PR. Please rebase on master and fix merge conflicts. There are 12k affected lines right now.

@pront pront added the meta: awaiting author Pull requests that are awaiting their author. label Jun 25, 2025
@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Jul 9, 2025
Copy link
Contributor

@thomasqueirozb thomasqueirozb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Should be good to go realistically. Would only like to talk about the string uri and I'll commit the other required changes myself

@thomasqueirozb
Copy link
Contributor

It actually looks like integration tests are failing when I run ./scripts/run-integration-tests int doris due to a DB auth issue. I partially fixed the integration tests but didn't debug much further after I hit this issue

@thomasqueirozb thomasqueirozb added the meta: awaiting author Pull requests that are awaiting their author. label Dec 22, 2025
@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Dec 25, 2025
@bingquanzhao
Copy link
Contributor Author

It actually looks like integration tests are failing when I run ./scripts/run-integration-tests int doris due to a DB auth issue. I partially fixed the integration tests but didn't debug much further after I hit this issue

Hi @thomasqueirozb ,I’ve addressed the type issue with base_url and fixed the integration test issues.

@freejool
Copy link

@bingquanzhao Hi, I added header group_commit: async_mode to conf, and got an error from doris "label and group_commit can't be set at the same time". Can you provide a switch to disable label generation? (more efficiency and less integrity)

@bingquanzhao
Copy link
Contributor Author

@bingquanzhao Hi, I added header group_commit: async_mode to conf, and got an error from doris "label and group_commit can't be set at the same time". Can you provide a switch to disable label generation? (more efficiency and less integrity)

I will add a check. When group_commit is set, do not set the label.

@bingquanzhao
Copy link
Contributor Author

@bingquanzhao Hi, I added header group_commit: async_mode to conf, and got an error from doris "label and group_commit can't be set at the same time". Can you provide a switch to disable label generation? (more efficiency and less integrity)

The content you mentioned I have already pushed. You can give it a try.

@freejool
Copy link

freejool commented Jan 4, 2026

@bingquanzhao Hi, I added header group_commit: async_mode to conf, and got an error from doris "label and group_commit can't be set at the same time". Can you provide a switch to disable label generation? (more efficiency and less integrity)

The content you mentioned I have already pushed. You can give it a try.

Thanks! It works perfectly!

@bingquanzhao
Copy link
Contributor Author

Hi @thomasqueirozb , Do I still need to do anything else?

@thomasqueirozb
Copy link
Contributor

thomasqueirozb commented Jan 6, 2026

Hi @bingquanzhao, sorry for the delay as I was out for the last two weeks. I have not yet reviewed the latest changes in the PR but for right now no action is needed from your side :)

@bingquanzhao
Copy link
Contributor Author

Regarding the check-component-docs CI failure:
I ran make generate-component-docs locally, but the doris.cue file has no changes - it remains identical to what's already committed.
Could you help take a look at why this is happening?

@thomasqueirozb thomasqueirozb added this pull request to the merge queue Jan 7, 2026
Merged via the queue into vectordotdev:master with commit eabdd5e Jan 7, 2026
50 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jan 7, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

domain: ci Anything related to Vector's CI environment domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks editorial review sink: new Request or implementation of a new sink

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants